64 research outputs found

    Genome data analysis, protein function and structure prediction by machine learning techniques

    Get PDF
    Dissertation supervisor: Professor Jianlin Cheng.Includes vita.The raw information of a typical human genome has been generated at 2001 by Human Genome Project. However, since there are a huge amount of data, it is still a big challenge for people to understand them, and extract useful structure and function information, such as the function of genes, the structure of proteins encoded by gene, and the function of proteins. Understanding these information is crucial for us to improve longevity and quality of life, and has a lot of applications, such as genomic medicine, drug design, and etc. In the meantime, machine learning techniques are growing rapidly and are good at processing large datasets, but many of them are limited for the impact on larger real world problems. In this thesis, three major contributions are described. First of all, we generate gene-gene interaction network from human genome conformation data by Hi-C technique, and the relationship of gene function and gene-gene interaction has been discovered. Second, we introduce a novel framework SMISS, which uses new source of information from gene-gene interaction network and uses a new way to integrate difference sources of information for protein function prediction. Finally, we introduce a tool called DeepQA which use machine learning technique to evaluate how well is the predicted protein structure, and a method MULTICOM for protein structure prediction. All of these protein structure and function prediction methods are available as software and web servers which are freely available to the scientific communities.Includes bibliographical references (pages 150-168)

    ProLanGO: Protein Function Prediction Using Neural~Machine Translation Based on a Recurrent Neural Network

    Full text link
    With the development of next generation sequencing techniques, it is fast and cheap to determine protein sequences but relatively slow and expensive to extract useful information from protein sequences because of limitations of traditional biological experimental techniques. Protein function prediction has been a long standing challenge to fill the gap between the huge amount of protein sequences and the known function. In this paper, we propose a novel method to convert the protein function problem into a language translation problem by the new proposed protein sequence language "ProLan" to the protein function language "GOLan", and build a neural machine translation model based on recurrent neural networks to translate "ProLan" language to "GOLan" language. We blindly tested our method by attending the latest third Critical Assessment of Function Annotation (CAFA 3) in 2016, and also evaluate the performance of our methods on selected proteins whose function was released after CAFA competition. The good performance on the training and testing datasets demonstrates that our new proposed method is a promising direction for protein function prediction. In summary, we first time propose a method which converts the protein function prediction problem to a language translation problem and applies a neural machine translation model for protein function prediction.Comment: 13 pages, 5 figure

    Designing and Evaluating the MULTICOM Protein Local and Global Model Quality Prediction Methods in the CASP10 Experiment

    Get PDF
    Background: Protein model quality assessment is an essential component of generating and using protein structural models. During the Tenth Critical Assessment of Techniques for Protein Structure Prediction (CASP10), we developed and tested four automated methods (MULTICOM-REFINE, MULTICOM-CLUSTER, MULTICOM-NOVEL, and MULTICOM-CONSTRUCT) that predicted both local and global quality of protein structural models. Results: MULTICOM-REFINE was a clustering approach that used the average pairwise structural similarity between models to measure the global quality and the average Euclidean distance between a model and several top ranked models to measure the local quality. MULTICOM-CLUSTER and MULTICOM-NOVEL were two new support vector machine-based methods of predicting both the local and global quality of a single protein model. MULTICOM-CONSTRUCT was a new weighted pairwise model comparison (clustering) method that used the weighted average similarity between models in a pool to measure the global model quality. Our experiments showed that the pairwise model assessment methods worked better when a large portion of models in the pool were of good quality, whereas single-model quality assessment methods performed better on some hard targets when only a small portion of models in the pool were of reasonable quality. Conclusions: Since digging out a few good models from a large pool of low-quality models is a major challenge in protein structure prediction, single model quality assessment methods appear to be poised to make important contributions to protein structure modeling. The other interesting finding was that single-model quality assessment scores could be used to weight the models by the consensus pairwise model comparison method to improve its accuracy

    SMOQ: A Tool for Predicting the Absolute Residue-Specific Quality of a Single Protein Model with Support Vector Machine

    Get PDF
    Background: It is important to predict the quality of a protein structural model before its native structure is known. The method that can predict the absolute local quality of individual residues in a single protein model is rare, yet particularly needed for using, ranking and refining protein models. Results: We developed a machine learning tool (SMOQ) that can predict the distance deviation of each residue in a single protein model. SMOQ uses support vector machines (SVM) with protein sequence and structural features (i.e. basic feature set), including amino acid sequence, secondary structures, solvent accessibilities, and residue-residue contacts to make predictions. We also trained a SVM model with two new additional features (profiles and SOV scores) on 20 CASP8 targets and found that including them can only improve the performance when real deviations between native and model are higher than 5Ã…. The SMOQ tool finally released uses the basic feature set trained on 85 CASP8 targets. Moreover, SMOQ implemented a way to convert predicted local quality scores into a global quality score. SMOQ was tested on the 84 CASP9 single-domain targets. The average difference between the residue-specific distance deviation predicted by our method and the actual distance deviation on the test data is 2.637Ã…. The global quality prediction accuracy of the tool is comparable to other good tools on the same benchmark. Conclusions: SMOQ is a useful tool for protein single model quality assessment. Its source code and executable are available at: http://sysbio.rnet.missouri.edu/multicom_toolbox/
    • …
    corecore